-
Notifications
You must be signed in to change notification settings - Fork 150
Description
Summary
When a PostgreSQL replica backend is unreachable (TCP connection refused), pgdog's health check mechanism does not detect the failure or ban the replica from the load balancer. Clients continue to be routed to the dead backend, experiencing timeouts before eventually falling back to the primary.
Environment
- pgdog:
latest(pulled ~2026-02-20) - Deployment: Docker Swarm via inline configs
- Backend: 2 PostgreSQL nodes (1 primary, 1 replica), replica is down (port 5432 not listening, connection refused)
Configuration
[general]
healthcheck_interval = 15_000
idle_healthcheck_interval = 5_000
idle_healthcheck_delay = 5_000
healthcheck_timeout = 2_000
ban_timeout = 300_000
connect_timeout = 1_000
connect_attempts = 1
checkout_timeout = 5_000
read_write_split = "include_primary_if_replica_banned"Expected Behavior
When a replica is unreachable (connection refused on port 5432):
- Health checks should detect the failure (either via idle health checks creating ephemeral connections, or via failed client connection attempts)
- The error counter for that pool should increment
- The replica should be banned from the load balancer
- With
read_write_split = "include_primary_if_replica_banned", reads should route to the primary
Actual Behavior
SHOW POOLSshowserrors = 0andbanned = ffor the dead replica — indefinitelySHOW SERVERSshows zero server connections to the dead replica (expected)SHOW REPLICATIONshows the replica with empty LSN values- Read queries still get routed to the dead replica's pool
- Clients experience ~6-10s delays (connect_timeout + checkout_timeout) before pgdog falls back to the primary
- The error counter never increments, so the replica is never banned
Reproduction Steps
- Configure pgdog with 2 backends (1 primary, 1 replica) using
role = "auto" - Stop PostgreSQL on the replica (so port 5432 returns connection refused)
- Wait for health check intervals to pass
- Run
SHOW POOLS— observeerrors = 0, banned = ffor the replica - Run a read query — observe it takes ~6-10s instead of <1s
- Run
SHOW POOLSagain — errors still 0
Impact
This defeats the purpose of health checks and banning for the most common failure mode (backend process stopped/crashed). The include_primary_if_replica_banned setting works correctly when banning is triggered, but the ban never fires for connection-refused failures.
Workaround
Currently the only workaround is to manually remove the dead replica from the pgdog config and restart, which eliminates the automatic failover benefit.
Notes
- The documentation states: "If there are no idle connections available, PgDog will create an ephemeral connection to perform the healthcheck." This ephemeral connection should fail with connection-refused, but it doesn't appear to trigger a ban.
- This may be related to how pgdog handles
ECONNREFUSEDvs query errors on established connections. It seems only the latter triggers banning.