Setting replication password on standby node prior to running a standby keeper causes PGCONNECT_TIMEOUT wait before server comes up #837

rheaton · 2021-11-24T00:28:36Z

Observed on pg10 and pg11 (may be present in others):

This issue is problematic if there is a large PGCONNECT_TIMEOUT, as the checkpoint attempts for that timeout, before restarting postgres. The server will take PGCONNECT_TIMEOUT to come up, which is surprising.

Steps to reproduce:

Create a monitor and 2 nodes using password auth
Change replication password
Use pg_autoctl config set to update the replication password on the standby
Run standby, see it wait PGCONNECT_TIMEOUT to actually start up

We spent some time looking around the code, and we noticed the checkpoint line in keeper.c was added in a seemingly (to us) unrelated commit, and we aren't entirely clear why we are doing it when we haven't made a successful connection yet.

Logs:

00:08:55 482 INFO  keeper.c:974 Replication settings at "/tmp/auth/node2/recovery.conf" have changed, restarting Postgres
00:08:55 482 DEBUG pgsql.c:506 Connecting to [local] "postgres://@localhost:5432/postgres?"
00:08:55 482 WARN  pgsql.c:596 Failed to connect to "postgres://@localhost:5432/postgres?", retrying until the server is ready
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 3 ms on attempt 2
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 5 ms on attempt 3
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 8 ms on attempt 4
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 13 ms on attempt 5
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 27 ms on attempt 6
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 17 ms on attempt 7
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 21 ms on attempt 8
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 30 ms on attempt 9
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 37 ms on attempt 10
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 78 ms on attempt 11
00:08:55 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 188 ms on attempt 12
00:08:56 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 348 ms on attempt 13
00:08:57 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 852 ms on attempt 14
00:08:57 482 DEBUG pgsql.c:643 PQping(postgres://@localhost:5432/postgres?): slept 761 ms on attempt 15
00:08:57 482 ERROR pgsql.c:468 Connection to database failed: connection to server at "localhost" (::1), port 5432 failed: Connection refused
00:08:57 482 ERROR pgsql.c:472 	Is the server running on that host and accepting TCP/IP connections?
00:08:57 482 ERROR pgsql.c:472 connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
00:08:57 482 ERROR pgsql.c:472 	Is the server running on that host and accepting TCP/IP connections?
00:08:57 482 DEBUG pgsql.c:429 Disconnecting from [local] "postgres://@localhost:5432/postgres?"
00:08:57 482 ERROR pgsql.c:625 Failed to connect to "postgres://@localhost:5432/postgres?" after 15 attempts in 2411 ms, pg_autoctl stops retrying now
00:08:57 482 WARN  keeper.c:979 Failed to CHECKPOINT before restart, see above for details
00:08:57 482 INFO  keeper.c:681 Restarting Postgres at "/tmp/auth/node2"
00:08:57 482 DEBUG state.c:941 Writing keeper postgres expected state file at "/tmp/pg_autoctl/tmp/auth/node2/pg_autoctl.pg"
00:08:57 482 DEBUG state.c:943 keeper_postgres_state_create: version = 1
00:08:57 482 DEBUG state.c:945 keeper_postgres_state_create: ExpectedPostgresStatus = Postgres should be stopped
00:08:57 482 DEBUG state.c:941 Writing keeper postgres expected state file at "/tmp/pg_autoctl/tmp/auth/node2/pg_autoctl.pg"
00:08:57 482 DEBUG state.c:943 keeper_postgres_state_create: version = 1
00:08:57 482 DEBUG state.c:945 keeper_postgres_state_create: ExpectedPostgresStatus = Postgres should be running
00:08:57 482 DEBUG pgctl.c:1875 /usr/lib/postgresql/10/bin/pg_ctl status -D /tmp/auth/node2 [3]
00:08:57 481 DEBUG service_postgres.c:85 pg_autoctl started postgres in subprocess 529
00:08:57 529 INFO  pgctl.c:1573  /usr/lib/postgresql/10/bin/postgres -D /tmp/auth/node2 -p 5432 -h * -k ''
00:08:58 482 DEBUG pgsetup.c:1058 pg_setup_wait_until_is_ready(): postgres is running, pid 529 (was 0), after 0s and 1 attempt(s)
00:08:58 482 DEBUG pgsetup.c:1146 Postgres is now serving PGDATA "/tmp/auth/node2" on port 5432 with pid 529
00:08:58 482 DEBUG primary_standby.c:303 local_postgres_wait_until_ready: Postgres is running with pid 529
00:08:58 482 WARN  keeper.c:716 PostgreSQL was not running, restarted with pid 529
00:08:58 482 DEBUG state.c:49 Reading current state from "/home/docker/.local/share/pg_autoctl/tmp/auth/node2/pg_autoctl.state"
00:08:58 482 INFO  service_keeper.c:401 pg_autoctl service is running, current state is "catchingup"
00:08:58 482 DEBUG keeper.c:436 Update local PostgreSQL state
00:08:58 482 DEBUG pgsql.c:506 Connecting to [local] "postgres://@localhost:5432/postgres?"
00:08:58 482 DEBUG pgsql.c:1000 select pg_is_in_recovery(), coalesce(rep.sync_state, '') as sync_state, case when pg_is_in_recovery() then coalesce(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) else pg_current_wal_flush_lsn() end as current_lsn, pg_control_version, catalog_version_no, system_identifier, case when pg_is_in_recovery() then (select received_tli from pg_stat_wal_receiver) else (select timeline_id from pg_control_checkpoint())  end as timeline_id  from (values(1)) as dummy full outer join (select pg_control_version, catalog_version_no, system_identifier     from pg_control_system() ) as control on true full outer join (   select sync_state     from pg_replication_slots slot     join pg_stat_replication rep       on rep.pid = slot.active_pid   where slot_name ~ '^pgautofailover_standby_'       or slot_name = 'pgautofailover_standby' order by case sync_state          when 'quorum' then 4          when 'sync' then 3          when 'potential' then 2          when 'async' then 1          else 0 end     desc limit 1 ) as rep on true;
00:08:58 482 DEBUG pgsql.c:506 Connecting to [monitor] "postgres://autoctl_node@172.27.1.2:5432/pg_auto_failover?password=****&sslmode=prefer"
00:08:58 481 DEBUG pgsetup.c:1058 pg_setup_wait_until_is_ready(): postgres is running, pid 529 (was 0), after 0s and 1 attempt(s)
00:08:58 481 INFO  pgsetup.c:1146 Postgres is now serving PGDATA "/tmp/auth/node2" on port 5432 with pid 529
00:08:58 481 DEBUG pgctl.c:1612 Postgres logs from "/tmp/auth/node2/startup.log":
00:08:58 481 DEBUG pgctl.c:1616 2021-11-24 00:08:57.955 UTC [529] LOG:  listening on IPv4 address "0.0.0.0", port 5432

Thanks!
Rachel & @jchampio

The text was updated successfully, but these errors were encountered:

DimCitus · 2021-11-24T10:59:18Z

Hi @rheaton and @jchampio ; we added the ability to restart Postgres when reacting to replication configuration changes in the PR that introduced this call to CHECKPOINT. We call CHECKPOINT before restart to reduce the time window when it's impossible to connect to Postgres. I suppose in spirit it's the same as in the fsm_checkpoint_and_stop_postgres function, see src/bin/pg_autoctl/fsm_transition.c#L812-L822.

Now it seems to me that a good way to fix the user experience in the case you're reporting would be to check if Postgres is running first. When it's running, do as we do now. When it's not running, we don't need to CHECKPOINT, and we don't need to restart, we only need to start the service with the usual ensure_postgres_service_is_running call.

When updating replication settings, no need to attempt a checkpoint and restart when postgres keeper is not running. Just start the service. Fixes hapostgres#837

* No need to checkpoint and restart if pg is not running When updating replication settings, no need to attempt a checkpoint and restart when postgres keeper is not running. Just start the service. Fixes #837 * Small test fix: re.match only checks a single line re.search will not give us false positives in this test

DimCitus assigned rheaton Nov 24, 2021

DimCitus added bug Something isn't working user experience labels Nov 24, 2021

rheaton mentioned this issue Nov 24, 2021

Bug/837 No need to checkpoint and restart if pg is not running #839

Merged

DimCitus closed this as completed in #839 Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting replication password on standby node prior to running a standby keeper causes PGCONNECT_TIMEOUT wait before server comes up #837

Setting replication password on standby node prior to running a standby keeper causes PGCONNECT_TIMEOUT wait before server comes up #837

rheaton commented Nov 24, 2021

DimCitus commented Nov 24, 2021

Setting replication password on standby node prior to running a standby keeper causes PGCONNECT_TIMEOUT wait before server comes up #837

Setting replication password on standby node prior to running a standby keeper causes PGCONNECT_TIMEOUT wait before server comes up #837

Comments

rheaton commented Nov 24, 2021

DimCitus commented Nov 24, 2021