-
Notifications
You must be signed in to change notification settings - Fork 128
Feature/watch #809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/watch #809
Conversation
1733339 to
3bfe625
Compare
The idea is that our users and customers could have an interactive dashboard without needing to build one themselves from the watch(1) command and other utilities.
React quickly to input events (key strokes) while still fetching new data only twice a second.
1. Refrain from showing sub-seconds intervals, that's distracting 2. Keep showing the "Press F1 to exit", overlaying it if necessary
d22b3e6 to
8287bd6
Compare
| " --formation formation to query, defaults to 'default' \n" | ||
| " --group group to query formation, defaults to all \n" | ||
| " --count how many events to fetch, defaults to 10 \n" | ||
| " --watch display an auto-updating dashboard\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this still show the state too? Or only the events. Right now it displays both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we have a single watch command and 3 ways to call it. I think I like that we can add --watch to the existing commands and have the new dashboard, even though it's not just show state or show events anymore. I would vote for keeping it that way?
- right-align the footer (Press F1 to exit) - clean all the rows including the last one
src/bin/pg_autoctl/watch_colspecs.h
Outdated
| { COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 }, | ||
| { COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Last Report and Last Health instead of Lag, because that was considered confusing by some people.
| { COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 }, | |
| { COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 }, | |
| { COLUMN_TYPE_CONN_REPORT_LAG, "Report-Lag", 0 }, | |
| { COLUMN_TYPE_CONN_HEALTH_LAG, "Health-Lag", 0 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I changed that to “Last Check” and “Last Report” in the long forms, and made the health check column move just before the connection column, then we have the last report column, and only then we have the reported/assigned states. The short forms are now just “Check” and “Report” and it might be explicit enough...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions:
- Reverse the order of the logs, this will allow us to implement scrolling at some point. It also means that the newest log is close to the state window, so you don't have to look up and down.
src/bin/pg_autoctl/watch.c
Outdated
| } | ||
|
|
||
| /* time to finish our connection */ | ||
| pgsql_finish(pgsql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do this in the error cases too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC any error in the processing of the SQL query will internally close the connection already?
| print_watch_footer(WatchContext *context) | ||
| { | ||
| int r = context->rows - 1; | ||
| char *help = "Press F1 to exit"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F1 seems a weird button to use for exit. I think q or esc is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently F1 is common in Ncurses land... or at least in the docs from previous century. ESC is a can of worms, because of the way a qwerty keyboard is wired, ESC then a is the same thing as M-a (or Alt+a depending on how you want to write it). So at the moment we stay away from ESC and Meta / Alt.
While review how we display events, now use a similar policy as for the node states, so that we adjust dynamically to the size of the terminal for the events part too.
To enable that, we switch the terminal back to "cooked" mode where we can read the logs on stderr. When the connection to the monitor could be established again, we switch back to the "raw" mode with the previous settings and continue displaying our dashboard there.
We trick to avoid computing the description size, because we will install horizontal scrolling anyway. Avoid printing a column separator on-top of the description text...
Avoid redoing the whole events display unless something changed. When we're going to display the same thing that's already visible on-screen anyway, don't bother with doing anything: it's already visible.
I have done that now, with all the other changes. I am not sure I prefer it that way, but I have added the Event Id in the output to make it obvious that what's happening here. Also allows better communication (“can you look at event 98” makes it easier than spelling out the time/date and then picking one of the entries that happened in that same second). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to merge, except for one thing:
The "Last Check" column will only show somewhere between 1 or 5 seconds for me. Even if I took down a node. It will still reset back to 1 second even if the postgres node it is checking is down. So either there's a bug here or at least the contents of this column are not very useful to show. Since they don't seem to mean anything really.
I think this column is showing the actual monitor's health check worker behaviour and that we should get exposed to that and then see about either improving it, or fixing what we display. At the moment it's not clear to me, but I think we might want to fix the health check worker now that we see how it behaves actually. |
Implement a new
pg_autoctl watchcommand that uses the Ncurses API to display a auto-updating dashboard for pg_auto_failover. The dashboard refreshes every 500ms and reacts to terminal size changes. Depending on the terminal size the output contains a different number of columns.