Skip to content

Conversation

@DimCitus
Copy link
Collaborator

Implement a new pg_autoctl watch command that uses the Ncurses API to display a auto-updating dashboard for pg_auto_failover. The dashboard refreshes every 500ms and reacts to terminal size changes. Depending on the terminal size the output contains a different number of columns.

@DimCitus DimCitus added enhancement New feature or request user experience Size:M Effort Estimate: Medium labels Sep 17, 2021
@DimCitus DimCitus added this to the Sprint 2021 W37 W38 milestone Sep 17, 2021
@DimCitus DimCitus requested a review from JelteF September 17, 2021 09:14
@DimCitus DimCitus self-assigned this Sep 17, 2021
@DimCitus DimCitus added Size: XL Effort Estimate: eXtra Large and removed Size:M Effort Estimate: Medium labels Sep 23, 2021
" --formation formation to query, defaults to 'default' \n"
" --group group to query formation, defaults to all \n"
" --count how many events to fetch, defaults to 10 \n"
" --watch display an auto-updating dashboard\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still show the state too? Or only the events. Right now it displays both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we have a single watch command and 3 ways to call it. I think I like that we can add --watch to the existing commands and have the new dashboard, even though it's not just show state or show events anymore. I would vote for keeping it that way?

JelteF and others added 2 commits September 29, 2021 12:55
  - right-align the footer (Press F1 to exit)
  - clean all the rows including the last one
Comment on lines 161 to 162
{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },
{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Last Report and Last Health instead of Lag, because that was considered confusing by some people.

Suggested change
{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },
{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },
{ COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 },
{ COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 },

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I changed that to “Last Check” and “Last Report” in the long forms, and made the health check column move just before the connection column, then we have the last report column, and only then we have the reported/assigned states. The short forms are now just “Check” and “Report” and it might be explicit enough...

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

  1. Reverse the order of the logs, this will allow us to implement scrolling at some point. It also means that the newest log is close to the state window, so you don't have to look up and down.

}

/* time to finish our connection */
pgsql_finish(pgsql);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do this in the error cases too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC any error in the processing of the SQL query will internally close the connection already?

print_watch_footer(WatchContext *context)
{
int r = context->rows - 1;
char *help = "Press F1 to exit";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F1 seems a weird button to use for exit. I think q or esc is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently F1 is common in Ncurses land... or at least in the docs from previous century. ESC is a can of worms, because of the way a qwerty keyboard is wired, ESC then a is the same thing as M-a (or Alt+a depending on how you want to write it). So at the moment we stay away from ESC and Meta / Alt.

While review how we display events, now use a similar policy as for the node
states, so that we adjust dynamically to the size of the terminal for the
events part too.
To enable that, we switch the terminal back to "cooked" mode where we can
read the logs on stderr. When the connection to the monitor could be
established again, we switch back to the "raw" mode with the previous
settings and continue displaying our dashboard there.
We trick to avoid computing the description size, because we will install
horizontal scrolling anyway. Avoid printing a column separator on-top of the
description text...
Avoid redoing the whole events display unless something changed. When we're
going to display the same thing that's already visible on-screen anyway,
don't bother with doing anything: it's already visible.
@DimCitus DimCitus requested a review from JelteF September 30, 2021 09:52
@DimCitus
Copy link
Collaborator Author

Suggestions:

  1. Reverse the order of the logs, this will allow us to implement scrolling at some point. It also means that the newest log is close to the state window, so you don't have to look up and down.

I have done that now, with all the other changes. I am not sure I prefer it that way, but I have added the Event Id in the output to make it obvious that what's happening here. Also allows better communication (“can you look at event 98” makes it easier than spelling out the time/date and then picking one of the entries that happened in that same second).

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to merge, except for one thing:

The "Last Check" column will only show somewhere between 1 or 5 seconds for me. Even if I took down a node. It will still reset back to 1 second even if the postgres node it is checking is down. So either there's a bug here or at least the contents of this column are not very useful to show. Since they don't seem to mean anything really.

@DimCitus
Copy link
Collaborator Author

DimCitus commented Oct 7, 2021

I think this is good to merge, except for one thing:

The "Last Check" column will only show somewhere between 1 or 5 seconds for me. Even if I took down a node. It will still reset back to 1 second even if the postgres node it is checking is down. So either there's a bug here or at least the contents of this column are not very useful to show. Since they don't seem to mean anything really.

I think this column is showing the actual monitor's health check worker behaviour and that we should get exposed to that and then see about either improving it, or fixing what we display. At the moment it's not clear to me, but I think we might want to fix the health check worker now that we see how it behaves actually.

@DimCitus DimCitus merged commit 84576fd into master Oct 7, 2021
@DimCitus DimCitus deleted the feature/watch branch October 7, 2021 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Size: XL Effort Estimate: eXtra Large user experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants