Skip to content

Conversation

@DimCitus
Copy link
Collaborator

@DimCitus DimCitus commented May 7, 2021

Fixes #700
Fixed #701

@DimCitus DimCitus added enhancement New feature or request Size: S Effort Estimate: Small labels May 7, 2021
@DimCitus DimCitus added this to the Sprint 2021 W18 W19 milestone May 7, 2021
@DimCitus DimCitus requested a review from JelteF May 7, 2021 11:20
@DimCitus DimCitus self-assigned this May 7, 2021
@JelteF
Copy link
Contributor

JelteF commented May 19, 2021

When trying this locally using make cluster the monitor postgres crashes for me.

I ran make cluster like this:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3

Output is like this, once beginning of the setup completes:

17:40:17 3789 INFO  New state for node 3 "node3" (localhost:5503): wait_standby ➜ catchingup                                                                                                                                                                                                                                                       17:40:17 3789 INFO  Setting goal state of node 2 "node2" (localhost:5502) to catchingup after node 1 "node1" (localhost:5501) converged to wait_primary.
17:40:17 3789 INFO  New state for node 2 "node2" (localhost:5502): wait_standby ➜ catchingup
17:40:19 3789 WARN  WARNING:  terminating connection because of crash of another server process
17:40:19 3789 WARN  DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
17:40:19 3789 WARN  HINT:  In a moment you should be able to reconnect to the database and repeat your command.
17:40:19 3789 ERROR Failed to LISTEN "state": server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
17:40:19 3789 WARN  Re-establishing connection. We might miss notifications.
17:40:19 3789 ERROR Connection to database failed: FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR FATAL:  the database system is in recovery mode
17:40:19 3789 ERROR Failed to connect to local database at "port=5500 dbname=pg_auto_failover host=localhost user=autoctl_node", see above for details
....

@DimCitus DimCitus force-pushed the fix/monitor-health-check-events branch from b9a99ce to 52a3f58 Compare May 19, 2021 16:23
@DimCitus
Copy link
Collaborator Author

As per our private messages where it appears that the problem is:

TRAP: FailedAssertion("!(spiStatus == 9)", File: "health_check_metadata.c", Line: 213)
2021-05-19 18:00:22.878 CEST [4611] LOG:  background worker "pg_auto_failover monitor worker" (PID 4631) was terminated by signal 6: Aborted
2021-05-19 18:00:22.878 CEST [4611] DETAIL:  Failed process was running: UPDATE pgautofailover.node   SET health = 1, healthchecktime = now()  WHERE nodeid = 1    AND nodehost = 'localhost' AND nodeport = 5501  RETURNING node.*

I have updated the PR now:

modified   src/monitor/health_check_metadata.c
@@ -210,7 +210,7 @@ SetNodeHealthState(int nodeId,
 		pgstat_report_activity(STATE_RUNNING, query.data);
 
 		spiStatus = SPI_execute(query.data, false, 0);
-		Assert(spiStatus == SPI_OK_UPDATE);
+		Assert(spiStatus == SPI_OK_UPDATE_RETURNING);
 
 		if (healthState != previousHealthState)
 		{

@DimCitus DimCitus merged commit d945ca1 into master May 20, 2021
@DimCitus DimCitus deleted the fix/monitor-health-check-events branch May 20, 2021 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Size: S Effort Estimate: Small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Monitor stdout doesn't show events about nodes becoming unhealthy Background worker doesn't know about nodenames

3 participants