Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres databroker backend doesn't handle multiple nodes/failover #3634

Closed
alexrudd2 opened this issue Sep 28, 2022 · 8 comments · Fixed by #3637
Closed

postgres databroker backend doesn't handle multiple nodes/failover #3634

alexrudd2 opened this issue Sep 28, 2022 · 8 comments · Fixed by #3637
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@alexrudd2
Copy link
Contributor

What happened?

Pomerium's databroker failed to switch to a new primary PSQL host after the existing primary was taken offline.

What did you expect to happen?

Per pomerium docs, the connection string follows libpq. Per libpq docs:

It is possible to specify multiple host components, each with an optional port component, in a single URI. A URI of the form postgresql://host1:port1,host2:port2,host3:port3/ is equivalent to a connection string of the form host=host1,host2,host3 port=port1,port2,port3. As further described below, each host will be tried in turn until a connection is successfully established.

I configured the databroker with multiple hosts, and then took down the first host.
The databroker should have tried the next host (pg-1) in the PSQL connection string after pg-0 was unreachable.
It should have connected, but noticed pg-1 is missing the read-write session attribute.
It should then have tried the next host(pg-2), connect, confirmed the read-write session attribute, and continued functioning on the new primary.

How'd it happen?

I setup pomerium to use PSQL + repmgr as the databroker backend, with 3 nodes. pg-0, pg-1, pg-2 .
I then killed pg-0.
repmgr correctly initiated a failover to a standby node pg-2, as verified by pgadmin.
Screen Shot 2022-09-28 at 12 53 26 PM.
Pomerium just errored out.

What's your environment like?

  • Pomerium version (retrieve with pomerium --version): 0.19.1
  • Server Operating System/Architecture/Cloud:
uname -a
Linux water 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Docker Swarm (1 node): 20.10.18

version: '3.2'
services:
  pomerium:
    image: pomerium/pomerium:latest
    volumes:
      - ._wildcard.localhost.pomerium.io.pem:/pomerium/cert.pem:ro
      - ._wildcard.localhost.pomerium.io-key.pem:/pomerium/privkey.pem:ro

      ## Mount your config file : https://www.pomerium.com/docs/reference/
      - .config.yaml:/pomerium/config.yaml:ro
    ports:
      - 443:443
    secrets:
      - POMERIUM_SHARED_SECRET
      - POMERIUM_IDP_CLIENT_SECRET
      - POMERIUM_COOKIE_SECRET

  verify:
    image: pomerium/verify:latest
    expose:
      - 8000

  pg-0:
    image: docker.io/bitnami/postgresql-repmgr:14
    volumes:
      - pg_0_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_POSTGRES_PASSWORD=adminpassword
      - POSTGRESQL_USERNAME=customuser
      - POSTGRESQL_PASSWORD=custompassword
      - POSTGRESQL_DATABASE=pomerium
      - REPMGR_PASSWORD=repmgrpassword
      - REPMGR_PRIMARY_HOST=pg-0
      - REPMGR_PRIMARY_PORT=5432
      - REPMGR_PARTNER_NODES=pg-0,pg-1,pg-2:5432
      - REPMGR_NODE_NAME=pg-0
      - REPMGR_NODE_NETWORK_NAME=pg-0
      - REPMGR_PORT_NUMBER=5432
  pg-1:
    image: docker.io/bitnami/postgresql-repmgr:14
    ports:
      - 5432
    volumes:
      - pg_1_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_POSTGRES_PASSWORD=adminpassword
      - POSTGRESQL_USERNAME=customuser
      - POSTGRESQL_PASSWORD=custompassword
      - POSTGRESQL_DATABASE=pomerium
      - REPMGR_PASSWORD=repmgrpassword
      - REPMGR_PRIMARY_HOST=pg-0
      - REPMGR_PRIMARY_PORT=5432
      - REPMGR_PARTNER_NODES=pg-0,pg-1,pg-2:5432
      - REPMGR_NODE_NAME=pg-1
      - REPMGR_NODE_NETWORK_NAME=pg-1
      - REPMGR_PORT_NUMBER=5432
  pg-2:
    image: docker.io/bitnami/postgresql-repmgr:14
    ports:
      - 5432
    volumes:
      - pg_2_data:/bitnami/postgresql
    environment:
      - POSTGRESQL_POSTGRES_PASSWORD=adminpassword
      - POSTGRESQL_USERNAME=customuser
      - POSTGRESQL_PASSWORD=custompassword
      - POSTGRESQL_DATABASE=pomerium
      - REPMGR_PASSWORD=repmgrpassword
      - REPMGR_PRIMARY_HOST=pg-0
      - REPMGR_PRIMARY_PORT=5432
      - REPMGR_PARTNER_NODES=pg-0,pg-1,pg-2:5432
      - REPMGR_NODE_NAME=pg-2
      - REPMGR_NODE_NETWORK_NAME=pg-2
      - REPMGR_PORT_NUMBER=5432
volumes:
  pg_0_data:
    driver: local
  pg_1_data:
    driver: local
  pg_2_data:
    driver: local
secrets:
  POMERIUM_SHARED_SECRET:
    external: true
  POMERIUM_COOKIE_SECRET:
    external: true
  POMERIUM_IDP_CLIENT_SECRET:
    external: true

What's your config.yaml?

databroker_storage_type: postgres
databroker_storage_connection_string: postgres://customuser:custompassword@pg-0:5432,pg-1:5432,pg-2:5432/pomerium?sslmode=disable&target_session_attrs=read-write

What did you see in the logs?

{"level":"error","error":"failed to connect to `host=pg-0 user=customuser database=pomerium`: hostname resolving error (lookup pg-0 on 127.0.0.11:53: no such host)","time":"2022-09-28T17:54:32Z","message":"storage/postgres"}
ad infinitum
@alexrudd2
Copy link
Contributor Author

@desimone desimone added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 28, 2022
@desimone
Copy link
Contributor

Hey @alexrudd2 -- though this does not directly address your question (we are investigating), I notice your error looks related to DNS.

lookup pg-0 on 127.0.0.11:53: no such host

Is it possible DNS could be the underlying culprit here?

@alexrudd2
Copy link
Contributor Author

Hey @alexrudd2 -- though this does not directly address your question (we are investigating), I notice your error looks related to DNS.

lookup pg-0 on 127.0.0.11:53: no such host

Is it possible DNS could be the underlying culprit here?

Hello @desimone.
Haha, although I'm living dangerously disobeying the rule It's always DNS... it's not DNS.

The "DNS" here is Docker's Swarm overlay networking, which resolves service names - in this case pg-0 to whatever physical host they're running on. It resolves fine when the service is up, but fails when down. Here I intentionally took it down, but pg-1 and pg-2 still resolve fine.

@calebdoxsey
Copy link
Contributor

Our underlying postgres driver should support multiple hosts. There may be a bug in how it's handling this, or maybe we're setting it up wrong. We will investigate.

@calebdoxsey
Copy link
Contributor

calebdoxsey commented Sep 29, 2022

The postgres driver we use treats DNS resolution errors differently from connection errors. This cause an immediate error rather than falling back to one of the other hosts. This behavior is understandable, so I'm not sure it's a bug, but I think we can configure the driver to work differently regardless. I will see if this fixes the problem.

@alexrudd2
Copy link
Contributor Author

I deployed git-3b2cc672 today. Things looked good - all 3 PSQL nodes came online (using basically the config I posted), pomerium connected to the PSQL backend, and it served traffic. I successfully took down the primary PSQL node and the databroker survived, switching over to the new primary.

## WORKING
# Use HA PSQL databroker backend
databroker_storage_type: postgres
databroker_storage_connection_string: postgres://pomerium:pomerium@pg-0,pg-1,pg-2/pomerium?sslmode=disable&target_session_attrs=read-write

===

However, a word of caution! target_session_attrs=read-write is absolutely critical. Without it libpq correctly switches to a second node, but has a chance to end up on a read-only replica. When I tested this, the constant write attempts eventually took down repmgr and even the entire PSQL container.

{"level":"info","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","time":"2022-10-04T13:08:04-05:00","message":"initial sync"}
{"level":"info","type":"type.googleapis.com/pomerium.config.Config","time":"2022-10-04T13:08:04-05:00","message":"sync latest"}
{"level":"error","syncer_id":"databroker","syncer_type":"type.googleapis.com/pomerium.config.Config","error":"rpc error: code = Unknown desc = ERROR: cannot execute CREATE SCHEMA in a read-only transaction (SQLSTATE 
## BROKEN
# Use HA PSQL databroker backend
databroker_storage_type: postgres
databroker_storage_connection_string: postgres://customuser:custompassword@pg-0,pg-1,pg-2/pomerium?sslmode=disable

I'm not sure how the databroker works under the hood; perhaps on large installs it can make use of read-only replicas for load balancing? However, attempting to write on a read-only connection makes no sense to me. Arguably this behavior should be changed, or at the least the docs can be clarified in the docs to stress using target_session_attrs=read-write when using HA.

@alexrudd2
Copy link
Contributor Author

Had git-3b2cc672 deployed during a massive series of network disruptions last night, and user sessions came back up beautifully. Many thanks!

@desimone
Copy link
Contributor

desimone commented Oct 7, 2022

Added todo about better docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants