New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
postgres databroker backend doesn't handle multiple nodes/failover #3634
Comments
Hey @alexrudd2 -- though this does not directly address your question (we are investigating), I notice your error looks related to DNS. lookup pg-0 on 127.0.0.11:53: no such host Is it possible DNS could be the underlying culprit here? |
Hello @desimone. The "DNS" here is Docker's Swarm overlay networking, which resolves service names - in this case |
Our underlying postgres driver should support multiple hosts. There may be a bug in how it's handling this, or maybe we're setting it up wrong. We will investigate. |
The postgres driver we use treats DNS resolution errors differently from connection errors. This cause an immediate error rather than falling back to one of the other hosts. This behavior is understandable, so I'm not sure it's a bug, but I think we can configure the driver to work differently regardless. I will see if this fixes the problem. |
I deployed git-3b2cc672 today. Things looked good - all 3 PSQL nodes came online (using basically the config I posted), pomerium connected to the PSQL backend, and it served traffic. I successfully took down the primary PSQL node and the databroker survived, switching over to the new primary. ## WORKING
# Use HA PSQL databroker backend
databroker_storage_type: postgres
databroker_storage_connection_string: postgres://pomerium:pomerium@pg-0,pg-1,pg-2/pomerium?sslmode=disable&target_session_attrs=read-write === However, a word of caution!
## BROKEN
# Use HA PSQL databroker backend
databroker_storage_type: postgres
databroker_storage_connection_string: postgres://customuser:custompassword@pg-0,pg-1,pg-2/pomerium?sslmode=disable I'm not sure how the databroker works under the hood; perhaps on large installs it can make use of read-only replicas for load balancing? However, attempting to write on a read-only connection makes no sense to me. Arguably this behavior should be changed, or at the least the docs can be clarified in the docs to stress using |
Had git-3b2cc672 deployed during a massive series of network disruptions last night, and user sessions came back up beautifully. Many thanks! |
Added todo about better docs |
What happened?
Pomerium's databroker failed to switch to a new primary PSQL host after the existing primary was taken offline.
What did you expect to happen?
Per pomerium docs, the connection string follows
libpq
. Per libpq docs:I configured the databroker with multiple hosts, and then took down the first host.
The databroker should have tried the next host (
pg-1
) in the PSQL connection string afterpg-0
was unreachable.It should have connected, but noticed
pg-1
is missing theread-write
session attribute.It should then have tried the next host(
pg-2
), connect, confirmed theread-write
session attribute, and continued functioning on the new primary.How'd it happen?
I setup pomerium to use PSQL +
repmgr
as the databroker backend, with 3 nodes.pg-0
,pg-1
,pg-2
.I then killed
pg-0
.repmgr
correctly initiated a failover to a standby nodepg-2
, as verified by pgadmin..
Pomerium just errored out.
What's your environment like?
pomerium --version
):0.19.1
Docker Swarm (1 node):
20.10.18
What's your config.yaml?
What did you see in the logs?
The text was updated successfully, but these errors were encountered: