standby node stuck in catchingup state when replication slot lost on primary node #1031

rhicks0614 · 2024-04-03T18:31:58Z

rhicks0614
Apr 3, 2024

hi,

I'm trying out pg_auto_failover version 2.1.2, using two nodes (postgresql-14).

I'm additionally setting max_slot_wal_keep_size, due to disk space limitations.

After stopping the standby node, the primary node went to "wait_primary" as expected.

Eventually the replication slot wal_status goes to "lost", due to max_slot_wal_keep_size being set:

postgres=# select * from pg_replication_slots;
        slot_name         | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | c
atalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
--------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--
------------+-------------+---------------------+------------+---------------+-----------
 pgautofailover_standby_2 |        | physical  |        |          | f         | f      |            |  749 |  
            |             |                     | lost       |               | f
(1 row)

Now after restarting the standby, I observe it is stuck in the "catchingup" state per "pg_autoctl show state":

                    Name |  Node |                                Host:Port |        TLI: LSN |   Connection |      Reported State |      Assigned State
-------------------------+-------+------------------------------------------+-----------------+--------------+---------------------+--------------------
appliance_apm00232407584 |     4 | fded:88e7:d92c:0:201:4427:b7b2:9093:5435 |   1: 1/3B521D68 |   read-write |        wait_primary |        wait_primary
appliance_apm00232407729 |     5 | fded:88e7:d92c:0:201:4493:23d1:6c9d:5435 |   1: 0/9E000000 |    read-only |          catchingup |          catchingup

and in the postgres logs, the standby is continually trying to start streaming again, even though the WAL segment has been removed (lost):

...
2024-04-03 17:46:54.633 UTC [5128] LOG:  started streaming WAL from primary at 0/9E000000 on timeline 1
2024-04-03 17:46:54.633 UTC [5128] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000009E has already been removed
2024-04-03 17:46:59.637 UTC [5135] LOG:  started streaming WAL from primary at 0/9E000000 on timeline 1
2024-04-03 17:46:59.637 UTC [5135] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000009E has already been removed
2024-04-03 17:47:04.641 UTC [5142] LOG:  started streaming WAL from primary at 0/9E000000 on timeline 1
2024-04-03 17:47:04.641 UTC [5142] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000009E has already been removed
...

Is there some way to trigger it to give up, and proceed to do pg_basebackup to recover?

thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standby node stuck in catchingup state when replication slot lost on primary node #1031

{{title}}

Replies: 0 comments

Select a reply

standby node stuck in catchingup state when replication slot lost on primary node #1031

rhicks0614 Apr 3, 2024

Replies: 0 comments

rhicks0614
Apr 3, 2024