Feature/watch #809

DimCitus · 2021-09-17T09:14:35Z

Implement a new pg_autoctl watch command that uses the Ncurses API to display a auto-updating dashboard for pg_auto_failover. The dashboard refreshes every 500ms and reacts to terminal size changes. Depending on the terminal size the output contains a different number of columns.

The idea is that our users and customers could have an interactive dashboard without needing to build one themselves from the watch(1) command and other utilities.

React quickly to input events (key strokes) while still fetching new data only twice a second.

1. Refrain from showing sub-seconds intervals, that's distracting 2. Keep showing the "Press F1 to exit", overlaying it if necessary

src/bin/pg_autoctl/monitor.c

src/bin/pg_autoctl/monitor.h

JelteF · 2021-09-29T10:00:56Z

src/bin/pg_autoctl/cli_show.c

 				 "  --formation   formation to query, defaults to 'default' \n"
 				 "  --group       group to query formation, defaults to all \n"
 				 "  --count       how many events to fetch, defaults to 10 \n"
+				 "  --watch       display an auto-updating dashboard\n"


Should this still show the state too? Or only the events. Right now it displays both.

Yeah we have a single watch command and 3 ways to call it. I think I like that we can add --watch to the existing commands and have the new dashboard, even though it's not just show state or show events anymore. I would vote for keeping it that way?

- right-align the footer (Press F1 to exit) - clean all the rows including the last one

JelteF · 2021-09-29T12:16:53Z

src/bin/pg_autoctl/watch_colspecs.h

+			{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },
+			{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },


Maybe use Last Report and Last Health instead of Lag, because that was considered confusing by some people.

Suggested change

{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },

{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },

{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },

{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },

Yeah I changed that to “Last Check” and “Last Report” in the long forms, and made the health check column move just before the connection column, then we have the last report column, and only then we have the reported/assigned states. The short forms are now just “Check” and “Report” and it might be explicit enough...

src/bin/pg_autoctl/watch_colspecs.h

JelteF

Suggestions:

Reverse the order of the logs, this will allow us to implement scrolling at some point. It also means that the newest log is close to the state window, so you don't have to look up and down.

src/bin/pg_autoctl/watch.c

JelteF · 2021-09-29T12:43:28Z

src/bin/pg_autoctl/watch.c

+		}
+
+		/* time to finish our connection */
+		pgsql_finish(pgsql);


should we do this in the error cases too?

IIRC any error in the processing of the SQL query will internally close the connection already?

src/bin/pg_autoctl/watch.c

JelteF · 2021-09-29T13:46:34Z

src/bin/pg_autoctl/watch.c

+print_watch_footer(WatchContext *context)
+{
+	int r = context->rows - 1;
+	char *help = "Press F1 to exit";


F1 seems a weird button to use for exit. I think q or esc is fine.

Apparently F1 is common in Ncurses land... or at least in the docs from previous century. ESC is a can of worms, because of the way a qwerty keyboard is wired, ESC then a is the same thing as M-a (or Alt+a depending on how you want to write it). So at the moment we stay away from ESC and Meta / Alt.

While review how we display events, now use a similar policy as for the node states, so that we adjust dynamically to the size of the terminal for the events part too.

To enable that, we switch the terminal back to "cooked" mode where we can read the logs on stderr. When the connection to the monitor could be established again, we switch back to the "raw" mode with the previous settings and continue displaying our dashboard there.

We trick to avoid computing the description size, because we will install horizontal scrolling anyway. Avoid printing a column separator on-top of the description text...

Avoid redoing the whole events display unless something changed. When we're going to display the same thing that's already visible on-screen anyway, don't bother with doing anything: it's already visible.

DimCitus · 2021-09-30T09:53:42Z

Suggestions:

Reverse the order of the logs, this will allow us to implement scrolling at some point. It also means that the newest log is close to the state window, so you don't have to look up and down.

I have done that now, with all the other changes. I am not sure I prefer it that way, but I have added the Event Id in the output to make it obvious that what's happening here. Also allows better communication (“can you look at event 98” makes it easier than spelling out the time/date and then picking one of the entries that happened in that same second).

JelteF

I think this is good to merge, except for one thing:

The "Last Check" column will only show somewhere between 1 or 5 seconds for me. Even if I took down a node. It will still reset back to 1 second even if the postgres node it is checking is down. So either there's a bug here or at least the contents of this column are not very useful to show. Since they don't seem to mean anything really.

DimCitus · 2021-10-07T15:54:54Z

I think this is good to merge, except for one thing:

The "Last Check" column will only show somewhere between 1 or 5 seconds for me. Even if I took down a node. It will still reset back to 1 second even if the postgres node it is checking is down. So either there's a bug here or at least the contents of this column are not very useful to show. Since they don't seem to mean anything really.

I think this column is showing the actual monitor's health check worker behaviour and that we should get exposed to that and then see about either improving it, or fixing what we display. At the moment it's not clear to me, but I think we might want to fix the health check worker now that we see how it behaves actually.

DimCitus added enhancement New feature or request user experience Size:M Effort Estimate: Medium labels Sep 17, 2021

DimCitus added this to the Sprint 2021 W37 W38 milestone Sep 17, 2021

DimCitus requested a review from JelteF September 17, 2021 09:14

DimCitus self-assigned this Sep 17, 2021

DimCitus force-pushed the feature/watch branch from 1733339 to 3bfe625 Compare September 17, 2021 10:18

DimCitus added Size: XL Effort Estimate: eXtra Large and removed Size:M Effort Estimate: Medium labels Sep 23, 2021

DimCitus added 19 commits September 24, 2021 13:41

Implement pg_autoctl watch command.

7cf9d80

The idea is that our users and customers could have an interactive dashboard without needing to build one themselves from the watch(1) command and other utilities.

Organize the display code better, print state from the monitor.

5442c9d

Implement dynamic column policies and display CurrentNodeStates.

507b88c

Add monitor events to the pg_autoctl watch display.

46527a9

Add support for report lag and health check lag to the dashboard.

1853fff

Improve the fist line (header).

f418ee1

Assorted improvements.

c5bd605

Add a --watch option to pg_autoctl show state|events.

7e656ee

Add documentation coverage.

eb626ed

Docs improvements.

40eb12f

Add libncurses to the Dockerfile dependencies.

8eac7da

Desultory review of the column policies.

938aa4a

Fix the main watch loop timing.

2193c51

Allow scrolling and highlighting a selected row.

9d7d955

Fix KEY_RIGHT when we already see the end of lines.

9a281ae

Keys u and d are synonym to page-up and page-down.

b46f26a

Allow incremental updates (steps).

33ee46f

React quickly to input events (key strokes) while still fetching new data only twice a second.

Improve horizontal scrolling steps.

47018c6

Per review, enhance the visuals some.

8287bd6

1. Refrain from showing sub-seconds intervals, that's distracting 2. Keep showing the "Press F1 to exit", overlaying it if necessary

DimCitus force-pushed the feature/watch branch from d22b3e6 to 8287bd6 Compare September 24, 2021 11:41

Quick fix (format string for intervals of more than a day).

5c91ee8

JelteF reviewed Sep 29, 2021

View reviewed changes

JelteF and others added 2 commits September 29, 2021 12:55

Only call resizeterm when number size is different from before

3f02075

Fix a couple of off-by-ones.

ab4e093

- right-align the footer (Press F1 to exit) - clean all the rows including the last one

JelteF reviewed Sep 29, 2021

View reviewed changes

src/bin/pg_autoctl/watch_colspecs.h Outdated Show resolved Hide resolved

JelteF reviewed Sep 29, 2021

View reviewed changes

DimCitus added 6 commits September 29, 2021 16:58

Per review, and improvements for Events display.

bc6542a

While review how we display events, now use a similar policy as for the node states, so that we adjust dynamically to the size of the terminal for the events part too.

Per review, fix Control-keys and selected row "jumping".

c49945a

Fix an event rendering bug.

bfdc778

We trick to avoid computing the description size, because we will install horizontal scrolling anyway. Avoid printing a column separator on-top of the description text...

Add an include for math.h, which somehow macOS didn't require.

8a1013f

Implement "cache invalidation" for the events part of the display.

add8fd0

Avoid redoing the whole events display unless something changed. When we're going to display the same thing that's already visible on-screen anyway, don't bother with doing anything: it's already visible.

DimCitus requested a review from JelteF September 30, 2021 09:52

Add the node names to the events table.

5f77199

DimCitus modified the milestones: Sprint 2021 W37 W38, Sprint 2021 W40 W41 Oct 5, 2021

JelteF suggested changes Oct 7, 2021

View reviewed changes

JelteF approved these changes Oct 7, 2021

View reviewed changes

DimCitus merged commit 84576fd into master Oct 7, 2021

DimCitus deleted the feature/watch branch October 7, 2021 16:20

		{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },
		{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },

Feature/watch #809

Feature/watch #809

Uh oh!

Conversation

DimCitus commented Sep 17, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DimCitus commented Sep 30, 2021

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

DimCitus commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants