postgres databroker backend doesn't handle multiple nodes/failover #3634

alexrudd2 · 2022-09-28T18:24:41Z

What happened?

Pomerium's databroker failed to switch to a new primary PSQL host after the existing primary was taken offline.

What did you expect to happen?

Per pomerium docs, the connection string follows libpq. Per libpq docs:

It is possible to specify multiple host components, each with an optional port component, in a single URI. A URI of the form postgresql://host1:port1,host2:port2,host3:port3/ is equivalent to a connection string of the form host=host1,host2,host3 port=port1,port2,port3. As further described below, each host will be tried in turn until a connection is successfully established.

I configured the databroker with multiple hosts, and then took down the first host.
The databroker should have tried the next host (pg-1) in the PSQL connection string after pg-0 was unreachable.
It should have connected, but noticed pg-1 is missing the read-write session attribute.
It should then have tried the next host(pg-2), connect, confirmed the read-write session attribute, and continued functioning on the new primary.

How'd it happen?

I setup pomerium to use PSQL + repmgr as the databroker backend, with 3 nodes. pg-0, pg-1, pg-2 .
I then killed pg-0.
repmgr correctly initiated a failover to a standby node pg-2, as verified by pgadmin.
.
Pomerium just errored out.

What's your environment like?

Pomerium version (retrieve with pomerium --version): 0.19.1
Server Operating System/Architecture/Cloud:

uname -a
Linux water 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Docker Swarm (1 node): 20.10.18

version: '3.2'
services:
  pomerium:
    image: pomerium/pomerium:latest
    volumes:
      - ._wildcard.localhost.pomerium.io.pem:/pomerium/cert.pem:ro
      - ._wildcard.localhost.pomerium.io-key.pem:/pomerium/privkey.pem:ro

      ## Mount your config file : https://www.pomerium.com/docs/reference/
      - .config.yaml:/pomerium/config.yaml:ro
    ports:
      - 443:443
    secrets:
      - POMERIUM_SHARED_SECRET
      - POMERIUM_IDP_CLIENT_SECRET
      - POMERIUM_COOKIE_SECRET

  verify:
    image: pomerium/verify:latest
    expose:
      - 8000

  pg-0:
    image: docker.io/bitnami/postgresql-repmgr:14
    volumes:
      - pg_0_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_POSTGRES_PASSWORD=adminpassword
      - POSTGRESQL_USERNAME=customuser
      - POSTGRESQL_PASSWORD=custompassword
      - POSTGRESQL_DATABASE=pomerium
      - REPMGR_PASSWORD=repmgrpassword
      - REPMGR_PRIMARY_HOST=pg-0
      - REPMGR_PRIMARY_PORT=5432
      - REPMGR_PARTNER_NODES=pg-0,pg-1,pg-2:5432
      - REPMGR_NODE_NAME=pg-0
      - REPMGR_NODE_NETWORK_NAME=pg-0
      - REPMGR_PORT_NUMBER=5432
  pg-1:
    image: docker.io/bitnami/postgresql-repmgr:14
    ports:
      - 5432
    volumes:
      - pg_1_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_POSTGRES_PASSWORD=adminpassword
      - POSTGRESQL_USERNAME=customuser
      - POSTGRESQL_PASSWORD=custompassword
      - POSTGRESQL_DATABASE=pomerium
      - REPMGR_PASSWORD=repmgrpassword
      - REPMGR_PRIMARY_HOST=pg-0
      - REPMGR_PRIMARY_PORT=5432
      - REPMGR_PARTNER_NODES=pg-0,pg-1,pg-2:5432
      - REPMGR_NODE_NAME=pg-1
      - REPMGR_NODE_NETWORK_NAME=pg-1
      - REPMGR_PORT_NUMBER=5432
  pg-2:
    image: docker.io/bitnami/postgresql-repmgr:14
    ports:
      - 5432
    volumes:
      - pg_2_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_POSTGRES_PASSWORD=adminpassword
      - POSTGRESQL_USERNAME=customuser
      - POSTGRESQL_PASSWORD=custompassword
      - POSTGRESQL_DATABASE=pomerium
      - REPMGR_PASSWORD=repmgrpassword
      - REPMGR_PRIMARY_HOST=pg-0
      - REPMGR_PRIMARY_PORT=5432
      - REPMGR_PARTNER_NODES=pg-0,pg-1,pg-2:5432
      - REPMGR_NODE_NAME=pg-2
      - REPMGR_NODE_NETWORK_NAME=pg-2
      - REPMGR_PORT_NUMBER=5432
volumes:
  pg_0_data:
    driver: local
  pg_1_data:
    driver: local
  pg_2_data:
    driver: local
secrets:
  POMERIUM_SHARED_SECRET:
    external: true
  POMERIUM_COOKIE_SECRET:
    external: true
  POMERIUM_IDP_CLIENT_SECRET:
    external: true

What's your config.yaml?

databroker_storage_type: postgres
databroker_storage_connection_string: postgres://customuser:custompassword@pg-0:5432,pg-1:5432,pg-2:5432/pomerium?sslmode=disable&target_session_attrs=read-write

What did you see in the logs?

{"level":"error","error":"failed to connect to `host=pg-0 user=customuser database=pomerium`: hostname resolving error (lookup pg-0 on 127.0.0.11:53: no such host)","time":"2022-09-28T17:54:32Z","message":"storage/postgres"}
ad infinitum

The text was updated successfully, but these errors were encountered:

alexrudd2 · 2022-09-28T18:55:25Z

@EdgardGomez @juli4nb4dillo @JamesJeffryes

desimone · 2022-09-29T17:35:08Z

Hey @alexrudd2 -- though this does not directly address your question (we are investigating), I notice your error looks related to DNS.

lookup pg-0 on 127.0.0.11:53: no such host

Is it possible DNS could be the underlying culprit here?

alexrudd2 · 2022-09-29T19:03:28Z

Hey @alexrudd2 -- though this does not directly address your question (we are investigating), I notice your error looks related to DNS.
lookup pg-0 on 127.0.0.11:53: no such host
Is it possible DNS could be the underlying culprit here?

Hello @desimone.
Haha, although I'm living dangerously disobeying the rule It's always DNS... it's not DNS.

The "DNS" here is Docker's Swarm overlay networking, which resolves service names - in this case pg-0 to whatever physical host they're running on. It resolves fine when the service is up, but fails when down. Here I intentionally took it down, but pg-1 and pg-2 still resolve fine.

calebdoxsey · 2022-09-29T20:01:05Z

Our underlying postgres driver should support multiple hosts. There may be a bug in how it's handling this, or maybe we're setting it up wrong. We will investigate.

calebdoxsey · 2022-09-29T20:34:19Z

The postgres driver we use treats DNS resolution errors differently from connection errors. This cause an immediate error rather than falling back to one of the other hosts. This behavior is understandable, so I'm not sure it's a bug, but I think we can configure the driver to work differently regardless. I will see if this fixes the problem.

alexrudd2 · 2022-10-04T18:42:58Z

I deployed git-3b2cc672 today. Things looked good - all 3 PSQL nodes came online (using basically the config I posted), pomerium connected to the PSQL backend, and it served traffic. I successfully took down the primary PSQL node and the databroker survived, switching over to the new primary.

## WORKING
# Use HA PSQL databroker backend
databroker_storage_type: postgres
databroker_storage_connection_string: postgres://pomerium:pomerium@pg-0,pg-1,pg-2/pomerium?sslmode=disable&target_session_attrs=read-write

===

However, a word of caution! target_session_attrs=read-write is absolutely critical. Without it libpq correctly switches to a second node, but has a chance to end up on a read-only replica. When I tested this, the constant write attempts eventually took down repmgr and even the entire PSQL container.

{"level":"info","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","time":"2022-10-04T13:08:04-05:00","message":"initial sync"}
{"level":"info","type":"type.googleapis.com/pomerium.config.Config","time":"2022-10-04T13:08:04-05:00","message":"sync latest"}
{"level":"error","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","error":"rpc error: code = Unknown desc = ERROR: cannot execute CREATE SCHEMA in a read-only transaction (SQLSTATE

## BROKEN
# Use HA PSQL databroker backend
databroker_storage_type: postgres
databroker_storage_connection_string: postgres://customuser:custompassword@pg-0,pg-1,pg-2/pomerium?sslmode=disable

I'm not sure how the databroker works under the hood; perhaps on large installs it can make use of read-only replicas for load balancing? However, attempting to write on a read-only connection makes no sense to me. Arguably this behavior should be changed, or at the least the docs can be clarified in the docs to stress using target_session_attrs=read-write when using HA.

alexrudd2 · 2022-10-06T17:57:18Z

Had git-3b2cc672 deployed during a massive series of network disruptions last night, and user sessions came back up beautifully. Many thanks!

desimone · 2022-10-07T03:01:41Z

Added todo about better docs

desimone added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 28, 2022

calebdoxsey mentioned this issue Sep 30, 2022

postgres: return an empty list of addresses on dns errors #3637

Merged

6 tasks

calebdoxsey closed this as completed in #3637 Sep 30, 2022

desimone mentioned this issue Oct 7, 2022

docs: suggest target_session_attrs=read-write when using HA and psql pomerium/documentation#126

Closed

alexrudd2 mentioned this issue Jan 12, 2023

Enterprise does not allow full postgres connection strings for database_url #3884

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

postgres databroker backend doesn't handle multiple nodes/failover #3634

postgres databroker backend doesn't handle multiple nodes/failover #3634

alexrudd2 commented Sep 28, 2022

alexrudd2 commented Sep 28, 2022

desimone commented Sep 29, 2022

alexrudd2 commented Sep 29, 2022

calebdoxsey commented Sep 29, 2022

calebdoxsey commented Sep 29, 2022 •

edited

alexrudd2 commented Oct 4, 2022

alexrudd2 commented Oct 6, 2022

desimone commented Oct 7, 2022

postgres databroker backend doesn't handle multiple nodes/failover #3634

postgres databroker backend doesn't handle multiple nodes/failover #3634

Comments

alexrudd2 commented Sep 28, 2022

What happened?

What did you expect to happen?

How'd it happen?

What's your environment like?

What's your config.yaml?

What did you see in the logs?

alexrudd2 commented Sep 28, 2022

desimone commented Sep 29, 2022

alexrudd2 commented Sep 29, 2022

calebdoxsey commented Sep 29, 2022

calebdoxsey commented Sep 29, 2022 • edited

alexrudd2 commented Oct 4, 2022

alexrudd2 commented Oct 6, 2022

desimone commented Oct 7, 2022

calebdoxsey commented Sep 29, 2022 •

edited