Skip to content

Add PgBouncer peering support #666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 20, 2023
Merged

Add PgBouncer peering support #666

merged 4 commits into from
Apr 20, 2023

Conversation

JelteF
Copy link
Member

@JelteF JelteF commented Jan 4, 2022

For both performance and availability reasons it can be useful to load
balance across multiple PgBouncer processes, possibly on different
machines. It's very easy to do, by simply putting a generic TCP load
balancer in front of these pgbouncer processes. The so_reuseport
option is an example of such a load balancer, which uses the OS kernel
to do the load balancing.

However, there is a problem with this simple approach. Query
cancellations will not work most of the time. The reason is that a
cancellation will open another TCP connection, which might be handled by
another pgbouncer process than the one that is processing the query that
should be cancelled. I talked about this problem in more detail at
the Uptime conference: https://www.youtube.com/watch?v=M585FfbboNA

This change adds support for pgbouncer peering. To enable peering two
things need to be added to the configuration of each pgbouncer:

  1. A peer_id that's unique for each pgbouncer that is peered with
  2. A new "peers" section that should get an entry for each of the other
    peers

When pgbouncer is configured like this, then whenever it receives a
cancellation request for an unknown session, it will forward it to the
peer that owns this session.

To configure three pgbouncers in peering mode using so_reuseport, you
can use these example configs.

pgbouncer1.ini

[databases]
postgres = host=localhost dbname=postgres

[pgbouncer]
peer_id=1
pool_mode=transaction
listen_addr=127.0.0.1
auth_type=trust
admin_users=jelte
auth_file=auth_file.conf
so_reuseport=1
unix_socket_dir=/tmp/pgbouncer1

[peers]
2 = host=/tmp/pgbouncer2
3 = host=/tmp/pgbouncer3

pgbouncer2.ini

[databases]
postgres = host=localhost dbname=postgres

[pgbouncer]
peer_id=2
pool_mode=transaction
listen_addr=127.0.0.1
auth_type=trust
admin_users=jelte
auth_file=auth_file.conf
so_reuseport=1
unix_socket_dir=/tmp/pgbouncer2

[peers]
1 = host=/tmp/pgbouncer1
3 = host=/tmp/pgbouncer3

pgbouncer3.ini

[databases]
postgres = host=localhost dbname=postgres

[pgbouncer]
peer_id=3
pool_mode=transaction
listen_addr=127.0.0.1
auth_type=trust
admin_users=jelte
auth_file=auth_file.conf
so_reuseport=1
unix_socket_dir=/tmp/pgbouncer3

[peers]
1 = host=/tmp/pgbouncer1
2 = host=/tmp/pgbouncer2

To then try out the cancellation forwarding. Simply run the following in
psql:

select pg_sleep(20);

And then press ctrl+c

Related to #345

@JelteF JelteF force-pushed the peering branch 3 times, most recently from c947da9 to 24075ce Compare January 4, 2022 14:58
@marcocitus
Copy link

marcocitus commented Jan 5, 2022

When pgbouncer is configured like this, then whenever it receives a cancellation request for an unknown session, it will forward it to the peer that owns this session.

Worth noting that the peer ID is encoded into the first 16 bits of the cancellation key sent to the client (which is normally a random 64-bit number), and the last bit is used to indicate whether the cancellation was forwarded. This reduces entropy somewhat, but it is still a lot higher than for a regular postgres server which uses the first 32 bits to encode the (quite predictable, or known) PID.

@JelteF
Copy link
Member Author

JelteF commented Jan 5, 2022

I updated various comments to make the code and the reasoning clearer, including one about the reduced entropy.

@JelteF JelteF force-pushed the peering branch 6 times, most recently from 4362edc to c15fa89 Compare December 5, 2022 07:44
@JelteF JelteF force-pushed the peering branch 6 times, most recently from b9ab2bb to 303e2ae Compare December 22, 2022 23:43
JelteF added a commit that referenced this pull request Jan 12, 2023
This changes our testing infrastructure to use python instead of bash.
Both `test/test.sh` and `test/ssl/test.sh` and have been completely 
coverted to python. 

The main reason I want to introduce this change is because Bash
is one of the worst languages to write tests in. One of the main goals
of tests is to find unexpected errors. And bash makes it very easy not
to notice such errors. This made it quite hard for me to write correct
tests and review newly introduced tests in PRs for correctness.

Another big reason I want this change is because writing tests 
for #666 requires running multiple pgbouncer processes at the 
same time. The bash infrastructure did not allow for this. The 
new python testing infrastructure.

Finally there's some other advantages that this new test
infrastructure brings:
1. Tests can be run in parallel, which makes them go **much** 
    faster. The duration of the test step in CI is now ~1 minute 
    for non valgrind runs and ~2 minutes for valgrind runs. 
    With the bash testing infrastructure these timings were 
    5.5 minutes and ~8 minutes respectively.
2. Running tests with `valgrind` locally is now as simple as 
    `ENABLE_VALGRIND=1 pytest -n auto`
3. Some tests would not run on some OSes (e.g. ssl tests on
    windows). I made sure almost all are now fully cross platform
    (for the platforms we support in CI).
4. `USE_SUDO=1` tests are now run in CI on MacOS and FreeBSD.
    (I confirmed `USE_SUDO=1` works with Linux on my local machine, 
    but was unable to get it working in CI due to `iptables` not being
    allowed in the CI containers).

How to run tests can be found in the updated `test/README.md` file.

PS. I haven't tested with OpenBSD, since I don't have access to such a
machine. I'm fairly confident that the tests will run fine though, since
they pass on all CI plaftorms (including MacOS and FreeBSD). But if
anyone else has access to an OpenBSD box, it would be great if you
could test that tests pass there as well.
@JelteF JelteF force-pushed the peering branch 2 times, most recently from 80e0e96 to bc1bb59 Compare January 16, 2023 16:04
@JelteF JelteF requested a review from eulerto January 19, 2023 15:16
@marcocitus
Copy link

[peers]
1 = host=/tmp/pgbouncer1
2 = host=/tmp/pgbouncer2

Is there any downside when users include the peer itself? (to keep config consistent)

free(peer->host);
peer->host = host;
peer->port = port;
peer->pool_size = pool_size;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there cases in which a pool_size > 1 is desirable? (should that be disallowed?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added user facing docs (which were partially missing) to explain why. If there are bursts of cancel requests you likely don't want to send those sequentially to the other process, so pool_size should almost always be larger than 1.

@JelteF JelteF force-pushed the peering branch 3 times, most recently from 060e07b to 5d38448 Compare February 14, 2023 11:15
@JelteF JelteF force-pushed the peering branch 3 times, most recently from bafa892 to 844ba88 Compare March 2, 2023 10:50
@marcocitus
Copy link

I took another look at this, and it seems pretty solid. The only issue I ran into is that if the peer is unreachable then cancellation can take up to query_wait_timeout (2 minutes by default) to return, which seems a little unexpected, though also not specific to the peering problem.

so_reuseport=1
unix_socket_dir=/tmp/pgbouncer1
peer_id=1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had to add a pidfile to make the example work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and it works fine without pidfile for me when running the two configs in two different shells like this:

./pgbouncer pgbouncer-peering1.ini
./pgbouncer pgbouncer-peering2.ini

I think it might have had to do with the fact that you didn't have /tmp/pgbouncer1 or /tmp/pgbouncer2 directories. I think that should be a hard error instead of a warning when using so_reuseport=1. But that's a separate issue that I'll open a new PR for.

JelteF and others added 4 commits April 20, 2023 15:14
For both performance and availability reasons it can be useful to load
balance across multiple PgBouncer processes, possibly on different
machines. It's very easy to do, by simply putting a generic TCP load
balancer in front of these pgbouncer processes. The `so_reuseport`
option is an example of such a load balancer, which uses the OS kernel
to do the load balancing.

However, there is a problem with this simple approach. Query
cancellations will not work most of the time. The reason is that a
cancellation will open another TCP connection, which might be handled by
another pgbouncer process than the one that is processing the query that
should be cancelled.

This change adds support to for pgbouncer peering. To enable peering two
things need to be added to the configuration of each pgbouncer:
1. A peer_id that's unique for each pgbouncer that is peered with
2. A new "peers" section that should get an entry for each of the other
   peers

When pgbouncer is configured like this, then whenever it receives a
cancellation request for an unknown session, it will forward it to the
peer that owns this session.

To configure three pgbouncers in peering mode using `so_reuseport`, you
can use these example configs.

pgbouncer1.ini
```ini
[databases]
postgres = host=localhost dbname=postgres

[pgbouncer]
peer_id=1
pool_mode=transaction
listen_addr=127.0.0.1
auth_type=trust
admin_users=jelte
auth_file=auth_file.conf
so_reuseport=1
unix_socket_dir=/tmp/pgbouncer1

[peers]
2 = host=/tmp/pgbouncer2
3 = host=/tmp/pgbouncer3
```

pgbouncer2.ini
```ini
[databases]
postgres = host=localhost dbname=postgres

[pgbouncer]
peer_id=2
pool_mode=transaction
listen_addr=127.0.0.1
auth_type=trust
admin_users=jelte
auth_file=auth_file.conf
so_reuseport=1
unix_socket_dir=/tmp/pgbouncer2

[peers]
1 = host=/tmp/pgbouncer1
3 = host=/tmp/pgbouncer3
```

pgbouncer3.ini
```ini
[databases]
postgres = host=localhost dbname=postgres

[pgbouncer]
peer_id=3
pool_mode=transaction
listen_addr=127.0.0.1
auth_type=trust
admin_users=jelte
auth_file=auth_file.conf
so_reuseport=1
unix_socket_dir=/tmp/pgbouncer3

[peers]
1 = host=/tmp/pgbouncer1
2 = host=/tmp/pgbouncer2
```

To then try out the cancellation forwarding. Simply run the following in
`psql`:
```sql
select pg_sleep(20);
```

And then press ctrl+c

Related to pgbouncer#345

Co-authored-by: Marco Slot <marco.slot@gmail.com>
@JelteF JelteF merged commit 8bdfc6f into pgbouncer:master Apr 20, 2023
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Apr 21, 2023
This was something that was forgotten to add to the docs in pgbouncer#666
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Apr 21, 2023
Cancellations and regular queries are quite different. So it makes sense
to have different (default) timeouts for cancellation requests than for
queries. While I already think the default query_wait_timeout of 120
seconds we have is excessively long, it doesn't make sense at all to
have such a long timeout by default for cancellation requests that are
not sent. So this introduced a new `cancel_wait_timeout` which is set to
10 seconds by default. If the cancellation request cannot be forwarded
within that time it's probably better to drop it.

One reason why it's important to relatively quickly close cancellation
requests when they are not being handled is because libpq only allows
the client to send a cancel in a blocking way. So there's no possibility
to bail out after a certain amount of seconds on the client side, if
PgBouncer keeps the connection to the client open. Both me and Marco
noticed this when testing my PgBouncer peering PR (pgbouncer#666), when an entry
in the `[peers]` list pointed to an unreachable host. That meant that
pgbouncer would keep the cancel connection open, and even though the
query had finished the `psql` would not display the result because it
was stuck waiting for a result to the cancel request.

To avoid race conditions of cancellations that already forwarded, but
not yet answered cancellation requests are not timed out. See pgbouncer#717 for
details on such race conditions.
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Apr 24, 2023
Cancellations and regular queries are quite different. So it makes sense
to have different (default) timeouts for cancellation requests than for
queries. While I already think the default query_wait_timeout of 120
seconds we have is excessively long, it doesn't make sense at all to
have such a long timeout by default for cancellation requests that are
not sent. So this introduced a new `cancel_wait_timeout` which is set to
10 seconds by default. If the cancellation request cannot be forwarded
within that time it's probably better to drop it.

One reason why it's important to relatively quickly close cancellation
requests when they are not being handled is because libpq only allows
the client to send a cancel in a blocking way. So there's no possibility
to bail out after a certain amount of seconds on the client side, if
PgBouncer keeps the connection to the client open. Both me and Marco
noticed this when testing my PgBouncer peering PR (pgbouncer#666), when an entry
in the `[peers]` list pointed to an unreachable host. That meant that
pgbouncer would keep the cancel connection open, and even though the
query had finished the `psql` would not display the result because it
was stuck waiting for a result to the cancel request.

To avoid race conditions of cancellations that already forwarded, but
not yet answered cancellation requests are not timed out. See pgbouncer#717 for
details on such race conditions.
JelteF added a commit that referenced this pull request Apr 24, 2023
This was something that was forgotten to add to the docs in #666
JelteF added a commit that referenced this pull request Apr 25, 2023
Cancellations and regular queries are quite different. So it makes sense
to have different (default) timeouts for cancellation requests than for
queries. While I already think the default query_wait_timeout of 120
seconds we have is excessively long, it doesn't make sense at all to
have such a long timeout by default for cancellation requests that are
not sent. So this introduced a new `cancel_wait_timeout` which is set to
10 seconds by default. If the cancellation request cannot be forwarded
within that time it's probably better to drop it.

One reason why it's important to relatively quickly close cancellation
requests when they are not being handled is because libpq only allows
the client to send a cancel in a blocking way. So there's no possibility
to bail out after a certain amount of seconds on the client side, if
PgBouncer keeps the connection to the client open. Both me and Marco
noticed this when testing my PgBouncer peering PR (#666), when an entry
in the `[peers]` list pointed to an unreachable host. That meant that
pgbouncer would keep the cancel connection open, and even though the
query had finished the `psql` would not display the result because it
was stuck waiting for a result to the cancel request.

To avoid race conditions of cancellations that already forwarded, but
not yet answered cancellation requests are not timed out. See #717 for
details on such race conditions.
@nijave nijave mentioned this pull request Jul 24, 2023
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Jul 25, 2023
This log line was introduced in pgbouncer#666, because that significantly changed
our cancellation logic. But when a client sends many cancellations this
will flood the logs with many identical log lines. To avoid that this
lowers the log level to debug.

Also start using `slog_xxx` functions instead of `log_xxx` functions
for easier debugging in `forward_cancel_request`.

Fixes pgbouncer#901
JelteF added a commit that referenced this pull request Aug 9, 2023
This log line was introduced in #666, because that significantly changed
our cancellation logic. But when a client sends many cancellations this
will flood the logs with many identical log lines. To avoid that this
lowers the log level to debug.

Also start using `slog_xxx` functions instead of `log_xxx` functions
for easier debugging in `forward_cancel_request`.

Fixes #901
@achanda
Copy link

achanda commented Sep 6, 2023

@JelteF I am not sure I understand why this is needed. In your example pgbouncer config, all the three pgbouncer instances talk to the same postgres host. In that case, the same postgres host will always receive the cancel request. I saw that your slides show two different postgres instances each with a pgbouncer in front, that setup makes sense to me. But as far as I can tell, this is not useful when a bunch of different pgbouncer instances all talk to the same postgres instance. Did I misunderstand this?

@JelteF
Copy link
Member Author

JelteF commented Sep 6, 2023

@achanda Yes, you misunderstood. This peering feature is useful if there's a load balancer in front of these different pgbouncer instances (or if you're using so_reuseport=1, which is effectively turns the linux kernel into a load balancer). The reason is that the cancellation might end up at a different PgBouncer instance. This is a problem even when the backing postgres server is identical, because cancellation requests are not forwarded directly to Postgres. Instead the PgBouncer instance translates the cancellation token it receives from the client, to the specific client, and then it send the cancellation token for the server that's attached to that client to the server. So it sends a different cancellation token to the server than it received from the client.

The client and server cancellation tokens cannot be the same, because when the client connects to PgBouncer, no server connection is assigned to it yet. And even if a connection would be assigned instantly, in transaction pooling mode the client will likely use another server connection for the next transaction, and the protocol does not allow updating the cancellation token. The cancellation token is static for the duration of the connection and is only sent during the start-up phase of the connection.

@achanda
Copy link

achanda commented Sep 6, 2023

Thanks for explaining, makes sense 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants