Log monitor health changes as events. #703

DimCitus · 2021-05-07T11:20:51Z

Fixes #700
Fixed #701

JelteF · 2021-05-19T15:45:00Z

When trying this locally using make cluster the monitor postgres crashes for me.

I ran make cluster like this:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3

Output is like this, once beginning of the setup completes:

17:40:17 3789 INFO  New state for node 3 "node3" (localhost:5503): wait_standby ➜ catchingup                                                                                                                                                                                                                                                       17:40:17 3789 INFO  Setting goal state of node 2 "node2" (localhost:5502) to catchingup after node 1 "node1" (localhost:5501) converged to wait_primary.
17:40:17 3789 INFO  New state for node 2 "node2" (localhost:5502): wait_standby ➜ catchingup
17:40:19 3789 WARN  WARNING:  terminating connection because of crash of another server process
17:40:19 3789 WARN  DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
17:40:19 3789 WARN  HINT:  In a moment you should be able to reconnect to the database and repeat your command.
17:40:19 3789 ERROR Failed to LISTEN "state": server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
....

DimCitus · 2021-05-19T16:24:20Z

As per our private messages where it appears that the problem is:

TRAP: FailedAssertion("!(spiStatus == 9)", File: "health_check_metadata.c", Line: 213)
2021-05-19 18:00:22.878 CEST [4611] LOG:  background worker "pg_auto_failover monitor worker" (PID 4631) was terminated by signal 6: Aborted
2021-05-19 18:00:22.878 CEST [4611] DETAIL:  Failed process was running: UPDATE pgautofailover.node   SET health = 1, healthchecktime = now()  WHERE nodeid = 1    AND nodehost = 'localhost' AND nodeport = 5501  RETURNING node.*

I have updated the PR now:

modified   src/monitor/health_check_metadata.c
@@ -210,7 +210,7 @@ SetNodeHealthState(int nodeId,
 		pgstat_report_activity(STATE_RUNNING, query.data);
 
 		spiStatus = SPI_execute(query.data, false, 0);
-		Assert(spiStatus == SPI_OK_UPDATE);
+		Assert(spiStatus == SPI_OK_UPDATE_RETURNING);
 
 		if (healthState != previousHealthState)
 		{

DimCitus added enhancement New feature or request Size: S Effort Estimate: Small labels May 7, 2021

DimCitus added this to the Sprint 2021 W18 W19 milestone May 7, 2021

DimCitus requested a review from JelteF May 7, 2021 11:20

DimCitus self-assigned this May 7, 2021

DimCitus modified the milestones: Sprint 2021 W18 W19, Sprint 2021 W20 W21 May 17, 2021

DimCitus added 2 commits May 19, 2021 18:23

Log monitor health changes as events.

7e0cc57

Per review.

52a3f58

DimCitus force-pushed the fix/monitor-health-check-events branch from b9a99ce to 52a3f58 Compare May 19, 2021 16:23

JelteF approved these changes May 20, 2021

View reviewed changes

DimCitus merged commit d945ca1 into master May 20, 2021

DimCitus deleted the fix/monitor-health-check-events branch May 20, 2021 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log monitor health changes as events. #703

Log monitor health changes as events. #703

Uh oh!

DimCitus commented May 7, 2021 •

edited

Loading

Uh oh!

JelteF commented May 19, 2021

Uh oh!

DimCitus commented May 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Log monitor health changes as events. #703

Log monitor health changes as events. #703

Uh oh!

Conversation

DimCitus commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JelteF commented May 19, 2021

Uh oh!

DimCitus commented May 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DimCitus commented May 7, 2021 •

edited

Loading